Entropy scanning fixups #319

rbailey-godaddy · 2022-01-11T23:27:59Z

To help us get this pull request reviewed and merged quickly, please be sure to include the following items:

Tests (if applicable)
Documentation (if applicable)
Changelog entry
A full explanation here in the PR description of the work done

PR Type

What kind of change does this PR introduce?

Backward Compatibility

Is this change backward compatible with the most recently released version? Does it introduce changes which might change the user experience in any way? Does it alter the API in any way?

Yes (backward compatible)
No (breaking changes)

"Almost" -- see description below for corner cases.

Issue Linking

closes #315

What's new?

The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned out to be problematic because we several examples of things that looked like base64 encodings but weren't and which generated new issues thanks to the extended combined alphabet. Simple tweaks looked problematic (read "nondeterministic") going forward.

This fix is a little bit more invasive.

We scan for BASE64 and BASE64URL encodings separately (using the same flakey expressions as pre-3.0.0 did) so they cannot bleed into each other. The cost of this is that we now make 3 (vs 2) regex scans of input text. A number of unit tests required revision to account for the additional calls.
To offset this cost, we further adjust the regex expressions to match a minimum of 20 characters, offloading much of the work from function util.find_strings_by_regex() which spent most of its time throwing away short strings. Nothing in the tartufo codebase allows users to supply a different minimum length; I added a warning for folks who might be building other code on top of tartufo.
Additionally, scanner.scan_entropy() used to split lines into words and scan the words individually. Since the supplied regexs will do this by themselves anyway, we eliminate the line.split() step and eliminate a pass over the input text. This required fixes to a number of unit tests where we generally make fewer calls on longer targets.
But wait! There's more! Many strings are both base64 and base64url, and we don't want to generate duplicate (or overlapping) issues. That resulted in the following two changes...
scanner.evaluate_entropy_string() now returns an optional issue instead of being a generator (that returned either 1 or 0 issues). This allows its only caller to inspect the return before passing it back to higher-level callers.
scanner.scan_entropy() now jumps through deduplication hoops:
- results from both base64 and base64url scans are combined in a set to eliminate strict duplicates; as a side effect, it is possible that lines with multiple issues (in the same line) may return the issues in a different order.
- target strings are evaluated from longest to shortest, and we no longer report an issue for a substring if we already reported an issue on the longer string containing it. We do not strictly test that the substring is located within the longer string, so a line that contained "bad-string even-more-bad-string" would not generate an issue for the initial "bad-string" even though it appears before the subsequent "even-more-bad-string" (which would get an issue).
- We also test for duplicate hex encoding issues (because a hex string is also a base64 string) -- this could not happen with default sensitivity (backstory omitted) but is nice to close off.

My testing shows variations in the noise level when tartufo scans itself, but it seems like the extra overhead should be approximately balanced by the efficiency gains and we'll see a wash. Time will tell.

Note it is still possible to get new issues (not reported by 2.x) if a string begins or ends with - or _ and the old string didn't generate an issue but the new longer string does.

The loosey-goosey "base64-anything" matching introduced in 3.0.0 turned out to be problematic because we several examples of things that looked like base64 encodings but weren't and which generated new issues thanks to the extended combined alphabet. Simple tweaks looked problematic (read "nondeterministic") going forward. This fix is a little bit more invasive. * We scan for BASE64 and BASE64URL encodings separately (using the same flakey expressions as pre-3.0.0 did) so they cannot bleed into each other. The cost of this is that we now make 3 (vs 2) regex scans of input text. A number of unit tests required revision to account for the additional calls. * To offset this cost, we further adjust the regex expressions to match a minimum of 20 characters, offloading much of the work from function `util.find_strings_by_regex()` which spent most of its time throwing away short strings. Nothing in the tartufo codebase allows users to supply a different minimum length; I added a warning for folks who might be building other code on top of tartufo. * Additionally, `scanner.scan_entropy()` used to split lines into words and scan the words individually. Since the supplied regexs will do this by themselves anyway, we eliminate the `line.split()` step and eliminate a pass over the input text. This required fixes to a number of unit tests where we generally make fewer calls on longer targets. * But wait! There's more! Many strings are both base64 and base64url, and we don't want to generate duplicate (or overlapping) issues. That resulted in the following two changes... * `scanner.evaluate_entropy_string()` now returns an optional issue instead of being a generator (that returned either 1 or 0 issues). This allows its only caller to inspect the return before passing it back to higher-level callers. * `scanner.scan_entropy()` now jumps through deduplication hoops: - results from both base64 and base64url scans are combined in a set to eliminate strict duplicates; as a side effect, it is possible that lines with multiple issues (in the same line) may return the issues in a different order. - target strings are evaluated from longest to shortest, and we no longer report an issue for a substring if we already reported an issue on the longer string containing it. We do not strictly test that the substring is located within the longer string, so a line that contained "bad-string even-more-bad-string" would not generate an issue for the initial "bad-string" even though it appears before the subsequent "even-more-bad-string" (which would get an issue). - We also test for duplicate hex encoding issues (because a hex string is also a base64 string) -- this could not happen with default sensitivity (backstory omitted) but is nice to close off. My testing shows variations in the noise level when tartufo scans itself, but it seems like the extra overhead should be approximately balanced by the efficiency gains and we'll see a wash. Time will tell. Note it is *still* possible to get new issues (not reported by 2.x) if a string begins or ends with `-` or `_` and the old string didn't generate an issue but the new longer string does.